Experiment with searching for SO terms

Background

A variant annotation record has a json structure like the following:

{u'createDateTime': u'2015-11-18T00:00:00Z',
 u'id': u'YnJjYTE6T1I0Rjp2YXJpYW50YW5ub3RhdGlvbnM6MTo2NDExNTo4NzFkNGM4OWE1Mzc0NjQwNjA2NDM0OTkzYWVmNGFmZQ',
 u'info': {},
 u'transcriptEffects': [{u'CDSLocation': None,
   u'alternateBases': u'A',
   u'analysisResults': [],
   u'cDNALocation': None,
   u'effects': [{u'id': u'SO:0001631',
     u'sourceName': None,
     u'sourceVersion': None,
     u'term': u'upstream_gene_variant'}],
   u'featureId': u'NM_001005484.1',
   u'hgvsAnnotation': {u'genomic': u'1:g.64116C>A',
    u'protein': u'',
    u'transcript': u'NM_001005484.1:c.-4975C>A'},
   u'id': u'2053be57055a40663aa02b2cdc9c7351',
   u'proteinLocation': None},
  {u'CDSLocation': None,
   u'alternateBases': u'A',
   u'analysisResults': [],
   u'cDNALocation': None,
   u'effects': [{u'id': u'SO:0000605',
     u'sourceName': None,
     u'sourceVersion': None,
     u'term': u'intergenic_region'}],
   u'featureId': u'FAM138A-OR4F5',
   u'hgvsAnnotation': {u'genomic': u'1:g.64116C>A',
    u'protein': u'',
    u'transcript': u'n.64116C>A'},
   u'id': u'6e6a547b0bdb446a78a3819bfcd6e06c',
   u'proteinLocation': None}],
 u'variantAnnotationSetId': u'YnJjYTE6T1I0Rjp2YXJpYW50YW5ub3RhdGlvbnM',
 u'variantId': u'YnJjYTE6T1I0RjoxOjY0MTE1OmU4Y2MyOTg2MGJmOTJjZGVmOTEwY2IyMzllYWVkZDI0'}

That is: variant annotation —⪪ transcript effects —⪪ effects

In the sample data, there are many variants with multiple transcript effects, but all transcriptEffects have exactly one effect (see below for data).

The Question

searchVariantAnnotations accepts an list of json-formatted array of effect filters. It is unclear (to me) how this should behave when multiple filters are provided. Specifically, given a set F of SO ids provided as a filter filtering (e.g., {SO:1,SO:2}), and a set S of SO ids associated with all transcriptEffects of a variant annotation, does a variant annotation VA with S "match" F if:

* F ⋂ S ≠ {}  -- at least one f ∈ F is in S
* F ⊂ S       -- all f ∈ F are also in S
* S ⊂ F       -- all s ∈ S are also in F
* F = S       -- sets are identical (⇒ all of the above are true)

It's hard to know what we want without use cases. However, it seems clear that users are likely to have one of two expectations:

  • SO filter terms are ANDed; that is, a VA matches if all f ∈ F are in S (i.e., F ⊆ S)
  • SO filter terms are ORed; that is, a VA matches if any f ∈ F is in S

Let's test.

Setup


In [1]:
import itertools
import pandas as pd
from pivottablejs import pivot_ui

import ga4gh.client
print(ga4gh.__version__)

gc = ga4gh.client.HttpClient("http://localhost:8000")


0.1.dev632+ncb43455c1003

In [2]:
region_constraints = dict(referenceName="1", start=0, end=int(1e10))
variant_set_id = 'YnJjYTE6T1I0Rg'
variant_annotation_sets = list(gc.searchVariantAnnotationSets(variant_set_id))
variant_annotation_set = variant_annotation_sets[0]
print("Using first variant annotation set (of {n} total) for variant set {vs_id}\nvas_id={vas.id}".format(
    n=len(variant_annotation_sets), vs_id=variant_set_id, vas=variant_annotation_set))


Using first variant annotation set (of 1 total) for variant set YnJjYTE6T1I0Rg
vas_id=YnJjYTE6T1I0Rjp2YXJpYW50YW5ub3RhdGlvbnM

Characterizing sample data


In [3]:
variant_annotations = [{
        'va':va,
        'n_te': len(list(va.transcriptEffects)),
        'n_ef': len(list(ef for te in va.transcriptEffects for ef in te.effects)),
        'sos': ";".join(sorted(set("{ef.id}:{ef.term}".format(ef=ef)
                                   for te in va.transcriptEffects
                                   for ef in te.effects)))
    }
    for va in gc.searchVariantAnnotations(variant_annotation_set.id, **region_constraints)
    ]
variant_annotations_df = pd.DataFrame(variant_annotations)

The following is an inline graphic image. See instructions below it for reproducing it.

To regenerate this data:

  1. Eval the next cell
  2. Select Bar Chart from Table menu
  3. Drag-drop "sos" to left column under Count pulldown
  4. Drag-drop n_te, then n_ef to row to right of Count pulldown

In [21]:
pivot_ui(variant_annotations_df)


Out[21]:

The searches

Using the data above, we can search for single and multiple terms and compare to expectations.

We'll be using this function:

Signature: gc.searchVariantAnnotations(variantAnnotationSetId, referenceName=None, referenceId=None, 
                                       start=None, end=None, featureIds=[], effects=[])
Docstring:
Returns an iterator over the Annotations fulfilling the specified conditions from the specified
AnnotationSet.

The JSON string for an effect term must be specified on the command line : 
`--effects '{"term": "exon_variant"}'`.

In [14]:
def _mk_effect_filter(so_ids=[]):
    """return list of so_id effect filters for the given list of so_ids

    >>> print(_mk_effect_filter(so_ids="SO:1 SO:2 SO:3".split()))
    ['{"id":"SO:1"}', '{"id":"SO:2"}', '{"id":"SO:3"}']
    """
    return [{"id": so_id} for so_id in so_ids]

def _fetch_variant_annotations(gc, so_ids=[], **args):
    return gc.searchVariantAnnotations(variant_annotation_set.id,
                                       effects=_mk_effect_filter(so_ids),
                                       **args)

In [19]:
# expected:
#so_terms
#SO:0000605:intergenic_region                                       697
#SO:0000605:intergenic_region;SO:0001631:upstream_gene_variant       63
#SO:0000605:intergenic_region;SO:0001632:downstream_gene_variant     56
#SO:0001583:missense_variant                                         16
#SO:0001587:stop_gained                                               1
#SO:0001819:synonymous_variant                                        7
        
[(so_set,
  len(list(_fetch_variant_annotations(gc, so_ids=so_set.split(), **region_constraints))))
 for so_set in ["SO:0001819", "SO:0001632", "SO:0000605", 
                "SO:0000605 SO:0001632", "SO:0001632 SO:0000605",
                "SO:9999999", "SO:0000605 SO:999999"]
 ]


Out[19]:
[('SO:0001819', 7),
 ('SO:0001632', 56),
 ('SO:0000605', 816),
 ('SO:0000605 SO:0001632', 816),
 ('SO:0001632 SO:0000605', 816),
 ('SO:9999999', 0),
 ('SO:0000605 SO:999999', 816)]

Conclusion

Searching uses disjunctive OR. That is, searching with a filter containing multiple terms returns the union of annotations that match at least one term. That's good because it means that conjuction (AND) may be applied on the return set.


In [ ]: